deductive output
PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference
February 18, 2025 A BSTRACT We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor G LMto replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for G LM(G-cache) and KV -cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDP A) is a special case of PLDR-LLM where G LMis predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV -cache and G-cache. 1 Introduction Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a novel language model architecture with well-defined deductive and inductive outputs [Gokden, 2024]. It is composed of deep layers of decoders with multi-headed Power Law Graph Attention (PLGA) [Gokden, 2021, 2019]. The deductive outputs are intended to observe and regularize the model, while the inductive output is the next-token prediction of a language model. PLGA is a series of non-linear and linear transformations that attend to an input sentence that can be considered as a weighted graph G = ( V, E) where nodes are the tokens densely represented by an N-dimensional embedding space. The PLGA learns a metric tensor A LMof the embedding space after applying a custom fully connected layer and iSwiGLU, a positive semi-definite activation function, to the output A of a deep residual network of gated linear units (GLUs) whose input is a density matrix operator derived from the query.
PLDR-LLM: Large Language Model from Power Law Decoder Representations
We present the Large Language Model from Power Law Decoder Representations (PLDR-LLM), a language model that leverages non-linear and linear transformations through Power Law Graph Attention mechanism to generate well-defined deductive and inductive outputs. We pretrain the PLDR-LLMs of varying layer sizes with a small batch size of 32 and $\sim$8B tokens from the RefinedWeb dataset, and show that they achieve competitive performance in zero-shot and few-shot settings compared to scaled dot-product LLMs of similar model size reported in the literature. We show that deductive outputs of PLDR-LLMs can be used to compare model characteristics or improve the performance by introducing the Directed Acyclic Graph (DAG) loss as a metric and regularizer. Our results indicate that the initial maximum learning rate and warm-up steps have a lasting impact on deductive outputs throughout the pretraining. We provide a detailed description of PLDR-LLM architecture, its implementation and the pretraining procedure.